Skip to content

Add UBB power and power_limit fields to npm_info for MI350X baseboard power monitoring (SWDEV-567812)#3262

Open
koushikbillakanti-amd wants to merge 1 commit intodevelopfrom
feature/SWDEV-567812-ubb-power-support
Open

Add UBB power and power_limit fields to npm_info for MI350X baseboard power monitoring (SWDEV-567812)#3262
koushikbillakanti-amd wants to merge 1 commit intodevelopfrom
feature/SWDEV-567812-ubb-power-support

Conversation

@koushikbillakanti-amd
Copy link
Contributor

Motivation

Add support for UBB (baseboard) power monitoring on MI350X as required for large cluster GPU workload health monitoring. This enables reading real-time baseboard power consumption and power limit thresholds via AMDSMI.

Technical Details

Extended rsmi_npm_info_t and amdsmi_npm_info_t structures with ubb_power and ubb_power_limit fields (ABI-compatible using reserved space). Added get_ubb_power() and get_ubb_power_limit() functions in rocm_smi_npm.cc to read from sysfs baseboard_power and baseboard_power_limit files. Updated Python wrapper and CLI to expose new fields.

JIRA ID

SWDEV-567812

Test Plan

Built and tested on MI350X hardware with live UBB sysfs files. Verified Python API returns correct values via amdsmi_get_npm_info(). Confirmed graceful NOT_SUPPORTED handling for processors without UBB capability.

Test Result

Successfully reads UBB Power (2127W) and UBB Power Limit (8400W) from MI350X card33. Build passes with no warnings. All 8 processors handled correctly with proper error handling for unsupported devices.

Submission Checklist

}


rsmi_status_t get_ubb_power(const std::string &board_path, uint64_t *power) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to have some duplicated work when looking at:

  • get_ubb_power()
  • get_ubb_power_limit()

The duplicated part could be extracted into a helper function

status = "DISABLED" if status == amdsmi_interface.amdsmi_wrapper.AMDSMI_NPM_STATUS_DISABLED else "ENABLED"
npm_dict.update({"status": status})
# Add UBB power info if available (not UINT64_MAX sentinel)
if ubb_power != "N/A" and ubb_power != 0xFFFFFFFFFFFFFFFF:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend defining a name constant for 0xFFFFFFFFFFFFFFFF then use it where needed.

Copy link
Contributor

@oliveiradan oliveiradan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koushikbillakanti-amd,
I added a few comments for you to take a look at.

@marifamd, @gabrpham,
Please, take a look at the CLI changes when you get a few mins.

@oliveiradan oliveiradan requested a review from a team February 14, 2026 03:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants